The BARB viewing figures for the local news 6:30 bulletin for BBC England from 2017.
| year | average weekly viewers |
|---|---|
| 2017 | 11,000,000 |
| 2018 | 10,000,000 |
| 2019 | 9,900,000 |
| 2020 | 11,000,000 |
| 2021 | 10,000,000 |
| 2022 | 8,600,000 |
| year | week_commencing | viewers |
|---|---|---|
| 2017 | 2017-01-02 | 11,000,000 |
| 2017 | 2017-01-09 | 13,000,000 |
| 2017 | 2017-01-16 | 12,000,000 |
| 2017 | 2017-01-23 | 12,000,000 |
| 2017 | 2017-01-30 | 12,000,000 |
| 2017 | 2017-02-06 | 11,000,000 |
| 2017 | 2017-02-13 | 11,000,000 |
| 2017 | 2017-02-20 | 12,000,000 |
| 2017 | 2017-02-27 | 12,000,000 |
| 2017 | 2017-03-06 | 11,000,000 |
| 2017 | 2017-03-20 | 13,000,000 |
| 2017 | 2017-03-27 | 11,000,000 |
| 2017 | 2017-04-03 | 10,000,000 |
| 2017 | 2017-04-10 | 9,500,000 |
| 2017 | 2017-04-17 | 9,700,000 |
| 2017 | 2017-04-24 | 11,000,000 |
| 2017 | 2017-05-01 | 10,000,000 |
| 2017 | 2017-05-08 | 11,000,000 |
| 2017 | 2017-05-15 | 11,000,000 |
| 2017 | 2017-05-22 | 11,000,000 |
| 2017 | 2017-05-29 | 11,000,000 |
| 2017 | 2017-06-05 | 12,000,000 |
| 2017 | 2017-06-12 | 11,000,000 |
| 2017 | 2017-06-19 | 11,000,000 |
| 2017 | 2017-06-26 | 11,000,000 |
| 2017 | 2017-07-03 | 9,300,000 |
| 2017 | 2017-07-10 | 11,000,000 |
| 2017 | 2017-07-17 | 10,000,000 |
| 2017 | 2017-07-24 | 9,800,000 |
| 2017 | 2017-07-31 | 10,000,000 |
| 2017 | 2017-08-07 | 10,000,000 |
| 2017 | 2017-08-14 | 10,000,000 |
| 2017 | 2017-08-21 | 11,000,000 |
| 2017 | 2017-08-28 | 9,600,000 |
| 2017 | 2017-09-04 | 11,000,000 |
| 2017 | 2017-09-11 | 10,000,000 |
| 2017 | 2017-09-18 | 11,000,000 |
| 2017 | 2017-09-25 | 9,900,000 |
| 2017 | 2017-10-02 | 11,000,000 |
| 2017 | 2017-10-09 | 11,000,000 |
| 2017 | 2017-10-16 | 11,000,000 |
| 2017 | 2017-10-23 | 10,000,000 |
| 2017 | 2017-10-30 | 11,000,000 |
| 2017 | 2017-11-06 | 11,000,000 |
| 2017 | 2017-11-13 | 11,000,000 |
| 2017 | 2017-11-20 | 11,000,000 |
| 2017 | 2017-11-27 | 11,000,000 |
| 2017 | 2017-12-04 | 11,000,000 |
| 2017 | 2017-12-11 | 11,000,000 |
| 2017 | 2017-12-18 | 12,000,000 |
| 2018 | 2018-01-08 | 12,000,000 |
| 2018 | 2018-01-15 | 12,000,000 |
| 2018 | 2018-01-22 | 12,000,000 |
| 2018 | 2018-01-29 | 11,000,000 |
| 2018 | 2018-02-05 | 11,000,000 |
| 2018 | 2018-02-12 | 11,000,000 |
| 2018 | 2018-02-19 | 11,000,000 |
| 2018 | 2018-02-26 | 11,000,000 |
| 2018 | 2018-03-05 | 14,000,000 |
| 2018 | 2018-03-12 | 11,000,000 |
| 2018 | 2018-03-26 | 11,000,000 |
| 2018 | 2018-04-02 | 9,500,000 |
| 2018 | 2018-04-09 | 10,000,000 |
| 2018 | 2018-04-16 | 10,000,000 |
| 2018 | 2018-04-23 | 10,000,000 |
| 2018 | 2018-04-30 | 11,000,000 |
| 2018 | 2018-05-07 | 9,900,000 |
| 2018 | 2018-05-14 | 8,800,000 |
| 2018 | 2018-05-21 | 9,900,000 |
| 2018 | 2018-05-28 | 9,800,000 |
| 2018 | 2018-06-04 | 9,200,000 |
| 2018 | 2018-06-11 | 9,600,000 |
| 2018 | 2018-06-18 | 9,000,000 |
| 2018 | 2018-07-02 | 10,000,000 |
| 2018 | 2018-07-09 | 10,000,000 |
| 2018 | 2018-07-16 | 10,000,000 |
| 2018 | 2018-07-23 | 10,000,000 |
| 2018 | 2018-07-30 | 9,900,000 |
| 2018 | 2018-08-06 | 9,500,000 |
| 2018 | 2018-08-13 | 9,500,000 |
| 2018 | 2018-08-20 | 9,700,000 |
| 2018 | 2018-08-27 | 9,400,000 |
| 2018 | 2018-09-03 | 9,600,000 |
| 2018 | 2018-09-10 | 9,700,000 |
| 2018 | 2018-09-17 | 9,600,000 |
| 2018 | 2018-09-24 | 9,800,000 |
| 2018 | 2018-10-01 | 9,800,000 |
| 2018 | 2018-10-08 | 9,800,000 |
| 2018 | 2018-10-15 | 10,000,000 |
| 2018 | 2018-10-22 | 10,000,000 |
| 2018 | 2018-10-29 | 10,000,000 |
| 2018 | 2018-11-05 | 10,000,000 |
| 2018 | 2018-11-12 | 10,000,000 |
| 2018 | 2018-11-19 | 10,000,000 |
| 2018 | 2018-11-26 | 10,000,000 |
| 2018 | 2018-12-03 | 10,000,000 |
| 2018 | 2018-12-10 | 10,000,000 |
| 2018 | 2018-12-17 | 10,000,000 |
| 2018 | 2018-12-24 | 11,000,000 |
| 2019 | 2019-01-07 | 9,700,000 |
| 2019 | 2019-01-14 | 11,000,000 |
| 2019 | 2019-01-21 | 11,000,000 |
| 2019 | 2019-01-28 | 10,000,000 |
| 2019 | 2019-02-04 | 11,000,000 |
| 2019 | 2019-02-11 | 11,000,000 |
| 2019 | 2019-02-18 | 10,000,000 |
| 2019 | 2019-02-25 | 10,000,000 |
| 2019 | 2019-03-04 | 11,000,000 |
| 2019 | 2019-03-11 | 10,000,000 |
| 2019 | 2019-03-18 | 11,000,000 |
| 2019 | 2019-03-25 | 10,000,000 |
| 2019 | 2019-04-01 | 9,900,000 |
| 2019 | 2019-04-08 | 9,900,000 |
| 2019 | 2019-04-15 | 9,400,000 |
| 2019 | 2019-04-22 | 8,800,000 |
| 2019 | 2019-04-29 | 9,000,000 |
| 2019 | 2019-05-06 | 9,800,000 |
| 2019 | 2019-05-13 | 8,900,000 |
| 2019 | 2019-05-20 | 8,900,000 |
| 2019 | 2019-05-27 | 9,200,000 |
| 2019 | 2019-06-03 | 9,200,000 |
| 2019 | 2019-06-10 | 9,300,000 |
| 2019 | 2019-06-17 | 10,000,000 |
| 2019 | 2019-06-24 | 9,900,000 |
| 2019 | 2019-07-01 | 9,200,000 |
| 2019 | 2019-07-08 | 8,400,000 |
| 2019 | 2019-07-15 | 8,700,000 |
| 2019 | 2019-07-22 | 9,300,000 |
| 2019 | 2019-07-29 | 9,400,000 |
| 2019 | 2019-08-05 | 9,400,000 |
| 2019 | 2019-08-12 | 9,500,000 |
| 2019 | 2019-08-19 | 9,600,000 |
| 2019 | 2019-08-26 | 8,900,000 |
| 2019 | 2019-09-02 | 9,400,000 |
| 2019 | 2019-09-09 | 9,500,000 |
| 2019 | 2019-09-16 | 9,100,000 |
| 2019 | 2019-09-23 | 9,600,000 |
| 2019 | 2019-09-30 | 9,900,000 |
| 2019 | 2019-10-07 | 9,500,000 |
| 2019 | 2019-10-14 | 10,000,000 |
| 2019 | 2019-10-21 | 10,000,000 |
| 2019 | 2019-10-28 | 10,000,000 |
| 2019 | 2019-11-04 | 10,000,000 |
| 2019 | 2019-11-11 | 10,000,000 |
| 2019 | 2019-11-18 | 11,000,000 |
| 2019 | 2019-11-25 | 10,000,000 |
| 2019 | 2019-12-02 | 10,000,000 |
| 2019 | 2019-12-09 | 9,900,000 |
| 2019 | 2019-12-16 | 11,000,000 |
| 2019 | 2019-12-23 | 11,000,000 |
| 2020 | 2020-01-13 | 11,000,000 |
| 2020 | 2020-01-20 | 11,000,000 |
| 2020 | 2020-01-27 | 11,000,000 |
| 2020 | 2020-02-03 | 11,000,000 |
| 2020 | 2020-02-10 | 11,000,000 |
| 2020 | 2020-02-17 | 11,000,000 |
| 2020 | 2020-02-24 | 11,000,000 |
| 2020 | 2020-03-02 | 11,000,000 |
| 2020 | 2020-03-09 | 11,000,000 |
| 2020 | 2020-03-16 | 12,000,000 |
| 2020 | 2020-03-23 | 16,000,000 |
| 2020 | 2020-03-30 | 16,000,000 |
| 2020 | 2020-04-06 | 15,000,000 |
| 2020 | 2020-04-13 | 13,000,000 |
| 2020 | 2020-04-20 | 12,000,000 |
| 2020 | 2020-04-27 | 13,000,000 |
| 2020 | 2020-05-04 | 13,000,000 |
| 2020 | 2020-05-11 | 12,000,000 |
| 2020 | 2020-05-18 | 13,000,000 |
| 2020 | 2020-05-25 | 11,000,000 |
| 2020 | 2020-06-01 | 10,000,000 |
| 2020 | 2020-06-08 | 12,000,000 |
| 2020 | 2020-06-15 | 12,000,000 |
| 2020 | 2020-06-22 | 12,000,000 |
| 2020 | 2020-06-29 | 11,000,000 |
| 2020 | 2020-07-06 | 11,000,000 |
| 2020 | 2020-07-13 | 11,000,000 |
| 2020 | 2020-07-20 | 11,000,000 |
| 2020 | 2020-07-27 | 10,000,000 |
| 2020 | 2020-08-03 | 11,000,000 |
| 2020 | 2020-08-10 | 10,000,000 |
| 2020 | 2020-08-17 | 10,000,000 |
| 2020 | 2020-08-24 | 10,000,000 |
| 2020 | 2020-08-31 | 11,000,000 |
| 2020 | 2020-09-07 | 9,300,000 |
| 2020 | 2020-09-14 | 10,000,000 |
| 2020 | 2020-09-21 | 10,000,000 |
| 2020 | 2020-09-28 | 11,000,000 |
| 2020 | 2020-10-05 | 11,000,000 |
| 2020 | 2020-10-12 | 11,000,000 |
| 2020 | 2020-10-19 | 12,000,000 |
| 2020 | 2020-10-26 | 12,000,000 |
| 2020 | 2020-11-02 | 12,000,000 |
| 2020 | 2020-11-09 | 12,000,000 |
| 2020 | 2020-11-16 | 12,000,000 |
| 2020 | 2020-11-23 | 11,000,000 |
| 2020 | 2020-11-30 | 12,000,000 |
| 2020 | 2020-12-07 | 11,000,000 |
| 2020 | 2020-12-14 | 11,000,000 |
| 2020 | 2020-12-21 | 11,000,000 |
| 2021 | 2021-01-04 | 13,000,000 |
| 2021 | 2021-01-11 | 13,000,000 |
| 2021 | 2021-01-18 | 13,000,000 |
| 2021 | 2021-01-25 | 12,000,000 |
| 2021 | 2021-02-01 | 12,000,000 |
| 2021 | 2021-02-08 | 12,000,000 |
| 2021 | 2021-02-15 | 11,000,000 |
| 2021 | 2021-02-22 | 12,000,000 |
| 2021 | 2021-03-01 | 12,000,000 |
| 2021 | 2021-03-08 | 11,000,000 |
| 2021 | 2021-03-15 | 12,000,000 |
| 2021 | 2021-03-22 | 11,000,000 |
| 2021 | 2021-03-29 | 9,800,000 |
| 2021 | 2021-04-05 | 9,900,000 |
| 2021 | 2021-04-12 | 10,000,000 |
| 2021 | 2021-04-19 | 9,800,000 |
| 2021 | 2021-04-26 | 9,800,000 |
| 2021 | 2021-05-03 | 9,200,000 |
| 2021 | 2021-05-10 | 10,000,000 |
| 2021 | 2021-05-17 | 9,900,000 |
| 2021 | 2021-05-24 | 9,500,000 |
| 2021 | 2021-05-31 | 8,100,000 |
| 2021 | 2021-06-07 | 9,000,000 |
| 2021 | 2021-06-14 | 9,300,000 |
| 2021 | 2021-06-21 | 11,000,000 |
| 2021 | 2021-06-28 | 7,800,000 |
| 2021 | 2021-07-05 | 11,000,000 |
| 2021 | 2021-07-12 | 9,400,000 |
| 2021 | 2021-07-19 | 9,200,000 |
| 2021 | 2021-07-26 | 10,000,000 |
| 2021 | 2021-08-02 | 10,000,000 |
| 2021 | 2021-08-09 | 8,800,000 |
| 2021 | 2021-08-16 | 9,000,000 |
| 2021 | 2021-08-23 | 9,000,000 |
| 2021 | 2021-08-30 | 8,600,000 |
| 2021 | 2021-09-06 | 9,100,000 |
| 2021 | 2021-09-13 | 9,000,000 |
| 2021 | 2021-09-20 | 9,200,000 |
| 2021 | 2021-09-27 | 9,600,000 |
| 2021 | 2021-10-04 | 9,500,000 |
| 2021 | 2021-10-11 | 9,300,000 |
| 2021 | 2021-10-18 | 9,600,000 |
| 2021 | 2021-10-25 | 9,400,000 |
| 2021 | 2021-11-01 | 9,800,000 |
| 2021 | 2021-11-08 | 9,700,000 |
| 2021 | 2021-11-15 | 10,000,000 |
| 2021 | 2021-11-22 | 9,600,000 |
| 2021 | 2021-11-29 | 10,000,000 |
| 2021 | 2021-12-06 | 10,000,000 |
| 2021 | 2021-12-13 | 10,000,000 |
| 2021 | 2021-12-20 | 9,600,000 |
| 2022 | 2022-01-03 | 9,200,000 |
| 2022 | 2022-01-10 | 10,000,000 |
| 2022 | 2022-01-17 | 10,000,000 |
| 2022 | 2022-01-24 | 9,700,000 |
| 2022 | 2022-01-31 | 9,700,000 |
| 2022 | 2022-02-07 | 9,300,000 |
| 2022 | 2022-02-14 | 10,000,000 |
| 2022 | 2022-02-21 | 11,000,000 |
| 2022 | 2022-02-28 | 11,000,000 |
| 2022 | 2022-03-07 | 10,000,000 |
| 2022 | 2022-03-14 | 9,700,000 |
| 2022 | 2022-03-21 | 9,500,000 |
| 2022 | 2022-03-28 | 9,200,000 |
| 2022 | 2022-04-04 | 8,700,000 |
| 2022 | 2022-04-11 | 7,600,000 |
| 2022 | 2022-04-18 | 7,700,000 |
| 2022 | 2022-04-25 | 8,500,000 |
| 2022 | 2022-05-02 | 7,500,000 |
| 2022 | 2022-05-09 | 8,300,000 |
| 2022 | 2022-05-16 | 8,400,000 |
| 2022 | 2022-05-23 | 7,800,000 |
| 2022 | 2022-05-30 | 7,400,000 |
| 2022 | 2022-06-06 | 8,000,000 |
| 2022 | 2022-06-13 | 7,500,000 |
| 2022 | 2022-06-20 | 7,800,000 |
| 2022 | 2022-06-27 | 6,800,000 |
| 2022 | 2022-07-04 | 6,500,000 |
| 2022 | 2022-07-11 | 7,600,000 |
| 2022 | 2022-07-18 | 8,300,000 |
| 2022 | 2022-07-25 | 8,200,000 |
| 2022 | 2022-08-01 | 8,200,000 |
| 2022 | 2022-08-08 | 8,000,000 |
| 2022 | 2022-08-15 | 7,800,000 |
| 2022 | 2022-08-22 | 8,200,000 |
| 2022 | 2022-08-29 | 7,400,000 |
| 2022 | 2022-09-05 | 7,400,000 |
A linear model is the simplest method to forecast. The best fit for a linear trend line is created from the historical data and projected onto future dates. The linear model, at it’s most basic, is nothing more than a line of best fit.
From the graph showing TV viewing, a few trends can be observed.
Adding a simple linear trend to the graph produces the forecast
shown below in black where a downwards trend is given. This trend has no
seasonality and looks to be inflated by the higher than typical viewing
during Covid. A blue trend line was also added which uses historic data
from pre-Covid and shows a much steeper downwards trend.
The linear model so far is of the form y ~ x where y is
the number of viewers and x is the week commencing. However the trends
identified visually are not dependent on the factor ‘week commencing’
itself.
Plotted on the graph are the linear models giving:
viewers ~ year in red,viewers ~ quarter in greenviewers ~ week in purpleUsing the year gives the gradual decline over all we
would expect, whilst using the quarter or the
week shows a seasonal variation. However, the seasonal
trend identifiable shows lower viewing in the summer months and higher
viewing in the winter, but both the quarter and week linear model show
the highest point at new year and a gradual decline throughout the year
which is not correct.
One way to artificially create the seasonal trend is to add an
x3 term to the year based model. This would take the form
viewers ~ year + poly(week,3) and gives the desired
seasonal trend.
A final step would be to include some dummy data to show that Covid
is an inflated period: here the data is the Google mobility trends where
values during the Covid period are negative, to decrease traffic, and
outside of the Covid period they’re zero. This would be included in the
model in the form viewers ~ year + poly(week,3) + covid and
is plotted on the graph in red.
This shows the seasonal trend as desired, the gradual yearly trend, and understands that the rise from Covid is not an indicator of future viewer numbers.
However, using the linear model in this way relied on a visual interpretation of the trend, particularly the seasonal one, rather than a statistical look at the data and a forecast model based upon this.
## year quarter year_quarter week_commencing week covid viewers
## 1 2022 3 2022_q3 2022-09-12 37 0 8360252
## 2 2022 3 2022_q3 2022-09-19 38 0 8402554
## 3 2022 3 2022_q3 2022-09-26 39 0 8449077
## 4 2022 4 2022_q4 2022-10-03 40 0 8499753
## 5 2022 4 2022_q4 2022-10-10 41 0 8554516
## 6 2022 4 2022_q4 2022-10-17 42 0 8613300
| year | average weekly viewers | percentage decrease from 2017 |
|---|---|---|
| 2017 | 11,000,000 | 0.0 |
| 2018 | 10,000,000 | -5.5 |
| 2019 | 9,900,000 | -9.1 |
| 2020 | 11,000,000 | 5.6 |
| 2021 | 10,000,000 | -6.8 |
| 2022 | 8,600,000 | -21.0 |
| 2022 | 8,800,000 | -19.0 |
| 2023 | 8,300,000 | -23.0 |
| 2024 | 7,900,000 | -27.0 |
| 2025 | 7,500,000 | -31.0 |
| 2026 | 7,100,000 | -35.0 |
| 2027 | 6,700,000 | -39.0 |
| 2028 | 6,200,000 | -43.0 |
| 2029 | 5,800,000 | -46.0 |
| 2030 | 5,400,000 | -50.0 |
Generally, any time series can be written as a series of three components: a trend, a seasonal component and a random component.
Auto-Regressive Integrated Moving Average is a more complex forecasting technique that looks at auto-correlations and moving averages, comparing the data at one period in time to the previous period.
The ARIMA model is mathematically of the form
\(\begin{aligned} y_t = & \delta + \epsilon_t + \\ & \phi_1 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & \theta_1 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)
The first terms \(\delta\) and \(\epsilon_{t}\) are essentially a white noise component with \(\delta\) a constant and \(\epsilon_{t}\) an error term for the given time which shows the deviation from the constant \(\delta\).
The terms given on the next line make the auto-regressive part of the
formula where the components dependent on previous values of \(y\) at given times \(t\) multiplied by a scale factor \(\phi\). The previous \(p\) number of terms are used. For example
if p = 2 then the value of y at the previous time \(y_{t-1}\) and the previous time \(y_{t-2}\) would be included.
The terms on the bottom line make the moving average part of the formula, where the difference between the previous value of y (written as \(y_{t-1}\)) and the moving average is given as \(\epsilon_{t-1}\) multiplied by a scale factor \(\theta\). Similar to the auto-regressive part, the previous \(q\) terms are used.
The ARIMA model is written in R in the form:
forecast::Arima(
time_series,
order = c(p,d,q),
seasonal = list(order = c(P,D,Q), period = n),
include.drift = TRUE
)
where the terms given are (in brief)
With the linear model it was simple to add in the Covid factor as a
variable. The ARIMA process doesn’t have this option so the initial data
set will be scaled based on the covid factor make this example process
simpler . This is show below with the red points the true data for the
Covid period and the blue points the scaled values. The data pre and
post Covid remains the same.
The TV data is very simply converted to a time series.
library(tseries);library(zoo);library(forecast)
## the data frame
tv %>% head()
## region year quarter year_quarter week week_commencing viewers covid
## 1 England 2017 1 2017_q1 1 2017-01-02 11308900 0
## 2 England 2017 1 2017_q1 2 2017-01-09 12659000 0
## 3 England 2017 1 2017_q1 3 2017-01-16 12093200 0
## 4 England 2017 1 2017_q1 4 2017-01-23 12086500 0
## 5 England 2017 1 2017_q1 5 2017-01-30 11540600 0
## 6 England 2017 1 2017_q1 6 2017-02-06 11328900 0
# Convert to time series
data_ts<-ts(tv$viewers,
freq= 365.25/7, #to get weeks in a year
start=decimal_date(tv$week_commencing %>% min())
)
data_ts %>% head()
## Time Series:
## Start = 2017.00273972603
## End = 2017.09856450358
## Frequency = 52.1785714285714
## [1] 11308900 12659000 12093200 12086500 11540600 11328900
## plot the data to show it's still of the same shape
plot(data_ts)
As discussed in the linear model section, a long term trend and a seasonal trend can both be identified visually but, whereas the linear model approximated relationships and added a simple x3 term to make the season trend appear correct, the ARIMA model uses statistical techniques to determine what trends are truly in the data.
Decomposing the series is a good way to gain an initial understanding.
plot(decompose(data_ts))
The trend section of the data shows the general decline but shows a leveling off during Covid. The season trend is clearly identified and shows the trend we know to be true of a rise in the winter months and a drop in the summer. The final section was automatically identified as being random fluctuations.
A stationary series is one where the observed value does not depend on the time and neither do the statistical properties such as mean or variance.
In a stationary series visually you would be able to see fluctuation from day to day, but no seasonality and no trend over time. To check this statistically the Augmented Dickey Fuller (ADF) test is used. If the p-value is less than 0.05 there is 95% confidence in the null hypothesis and the series is stationary.
## test for stationarity
adf.test(data_ts)
##
## Augmented Dickey-Fuller Test
##
## data: data_ts
## Dickey-Fuller = -3.5967, Lag order = 6, p-value = 0.03384
## alternative hypothesis: stationary
The p-value is below 0.05 suggesting the data is stationary, but visual observations very much suggest it isn’t.
The most common way to make a series stationary is to take the difference between one observation and the next, then test again for stationarity. If that series is not stationary, the difference may be taken again. The log of the values is also often taken to reduce the effect of extreme variation in the size of values.
The function ndiffs() gives a suggested number of time
the differencing is done.
## What is the suggested differencing?
ndiffs(data_ts)
## [1] 1
## What does this look like?
plot(data_ts %>% log %>% diff())
abline(h=0, col="red")
## Test for stationarity
adf.test(data_ts %>%log %>% diff())
##
## Augmented Dickey-Fuller Test
##
## data: data_ts %>% log %>% diff()
## Dickey-Fuller = -7.7555, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
The p-value is now lower than before, and lower than the 0.05
threshold so the null hypothesis can be accepted and the data series
deemed stationary. As the data was differenced once to achieve this
stationarity the ARIMA order value d takes the value
one.
Visual observations and the decomposition plot clearly show seasonality and cyclicity so the auto-regressive and moving average parts will need be be identified.
To determine the values for the order terms p and
q an auto-correlation plot (ACF) and a partial
auto-correlation plot (PACF) are plotted. The ACF is created by looking
at the correlation of a term with the term once prior, then the term
twice prior, then three times and so on. Below is an example of how this
is done manually and with one simple line of code.
## create the differenced data to mimic the differenced time series data
x<-tv %>% select(viewers) %>% mutate(diff = log(viewers) - log(lag(viewers)))%>% tail(-1)
head(x)
## viewers diff
## 2 12659000 0.1127783983
## 3 12093200 -0.0457251135
## 4 12086500 -0.0005541839
## 5 11540600 -0.0462178745
## 6 11328900 -0.0185142698
## 7 11477600 0.0130403269
## create three data sets with the actual data, one lag, and a second lag
x_0 <- x$diff %>% head(-1)
x_1 <- x$diff %>% tail(-1)
x_2 <- x$diff %>% tail(-2)
## to illustrate the lags
cbind(x_0,x_1,x_2) %>% head()
## x_0 x_1 x_2
## [1,] 0.1127783983 -0.0457251135 -0.0005541839
## [2,] -0.0457251135 -0.0005541839 -0.0462178745
## [3,] -0.0005541839 -0.0462178745 -0.0185142698
## [4,] -0.0462178745 -0.0185142698 0.0130403269
## [5,] -0.0185142698 0.0130403269 0.0340801445
## [6,] 0.0130403269 0.0340801445 -0.0303841185
## look at how correlated one lag is
plot(x_0, x_1)
abline(reg=lm(x_0 ~ x_1))
text(0.2,0.3, paste0('cor = ',round(cor(x_0,x_1),2)))
## look at how correlated two lag is
plot(x_0 %>% head(-1), x_2)
abline(reg=lm(x_0%>% head(-1) ~ x_2))
text(0.2,0.3, paste0('cor = ',round(cor(x_0 %>% head(-1),x_2),2)))
The correlations for the first two logs that were done manually are given below. Comparatively, the ACF plot finds the correlations for many lags and plots them.
## the correlation coeficients for lags 1 and 2
print(signif(cor(x_0,x_1),2) )
## [1] -0.41
print(signif(cor(x_0 %>% head(-1),x_2),2))
## [1] 0.0072
## the ACF will find each coefficient and then plot the results.
acf(x$diff, plot = FALSE) %>% head(n=6)
##
## Autocorrelations of series 'x$diff', by lag
##
## 1 2 3 4 5 6
## -0.407 0.007 -0.058 0.018 -0.038 0.111
acf(x$diff, plot = TRUE, lag = 120 )
# Using the time series gives exactly the same result
#acf(data_ts %>%log %>% diff())
The correlation with lag one data is significant as it falls outside
the threshold shown by the blue lines. This means the order value would
be q = 1.
Within this plot there are two peaks falling just outside the
threshold around lag 50 and lag 100 suggesting an annual correlation.
This suggests that there is a seasonal component for the model so
Q = 1 would be a good starting point.
The second technique uses a partial auto-correlation technique where
the correlation is considered but any related confounding variables are
removed. The PACF function is given below, and as with the ACF, as the
lag 1 term is above the significance threshold the order factor is
determined to be p = 1.
pacf(x$diff, plot = FALSE) %>% head(n=6)
##
## Partial autocorrelations of series 'x$diff', by lag
##
## 1 2 3 4 5 6
## -0.407 -0.190 -0.164 -0.099 -0.109 0.050
pacf(x$diff, lag = 120 )
#pacf(data_ts %>%log %>% diff()) equivalent for the time series data
Here there seems to be another spike at around 100 or two years, suggesting there may be a seasonal P component, but as there is not a similar peak around week 52 this is less clear.
From the test for stationarity, the ACF, and the PACF the order values for the ARIMA were determined:
p = 1d = 1q = 1An ARIMA model can then be used to forecast the data and provides a series of coefficients. The seasonal components P,D,Q are harder to identify. The ACF and PCF suggested seasonal components should be included. Initially use the same values as the order terms, with the period 52 weeks.
fit <- forecast::Arima(data_ts %>% log(),
order = c(1,1,1),
seasonal = list(order = c(1,1,1), period = 52),
#include.drift = TRUE,
method = 'CSS'
)
fit
## Series: data_ts %>% log()
## ARIMA(1,1,1)(1,1,1)[52]
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.2551 -0.829 -0.3403 -0.4160
## s.e. 0.0862 0.049 0.0732 0.0946
##
## sigma^2 = 0.00463: log likelihood = 268.85
The general equation is of the form
\(\begin{aligned} y_t = & \delta + \epsilon_t + \\ & \phi_1 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & \theta_1 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)
With the coefficients given this can be re-written as
\(\begin{aligned} y_t -y_{t-1} = & \delta + \epsilon_t + \\ & 0.2551 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & -0.829 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)
The \(y_t -y_{t-1}\) term is because the differenced data is being used and the coefficients are given by the model.
Once the moving average and autocorrelation parts are calculated, all that should remain is the white noise component.
The seasonal MA and AR components mathematically look the same, but have different values for the co-efficient and use the lag of the previous seasonal value. For example \(\phi_1 y_{t-1}\) for seasonal with period 52 weeks would take value \(-0.3403 y_{t-52}\).
The residuals from the fit should appear as white noise.
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(1,1,1)[52]
## Q* = 61.113, df = 53, p-value = 0.2075
##
## Model df: 4. Total lags used: 57
Plotting the residuals shows a stationary series with no trend. The residuals are normally distributed around a mean of one. The autocorrelation of the residuals shows a significant peak around lag 50. This suggests the seasonal term value P,D,Q may need altering.
pred <- predict(fit, n.ahead = 52*9)
ts.plot(data_ts , 2.718^pred$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")
The prediction into the future looks fairly good, but as the ACF residual has a significant value it would be worth comparing to a slightly different value.
Using seasonal components P,Q,D of 1,1,0
and 0,1,1 does not change the residual in the ACF that is
significant, but using 1,0,1 does remove it however, as
show below, that change produces no usable forecast.
fit2 <- forecast::Arima(data_ts %>% log(),
order = c(1,1,1),
seasonal = list(order = c(1,0,1), period = 52),
#include.drift = TRUE,
method = 'CSS'
)
fit2
## Series: data_ts %>% log()
## ARIMA(1,1,1)(1,0,1)[52]
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.1257 -0.6632 0.0536 0.0553
## s.e. 0.0976 0.0738 0.1342 0.1437
##
## sigma^2 = 0.00374: log likelihood = 366.09
checkresiduals(fit2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(1,0,1)[52]
## Q* = 53.122, df = 53, p-value = 0.4695
##
## Model df: 4. Total lags used: 57
pred2 <- predict(fit2, n.ahead = 52*9)
ts.plot(data_ts , 2.718^pred2$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")
A useful technique is to use the auto.arima function as
a starting point for deciding upon the components p,d,q and
P,D,Q.
For this case, the auto arima function gives
ARIMA(0,1,1)(0,0,1)[52] which is quite different to that
manually decided upon from observing the plots. The auto arima residuals
are very good, showing white noise behaviour, but the forecast is not
usable.
auto_fit <- forecast::auto.arima(data_ts %>% log())
auto_fit
## Series: data_ts %>% log()
## ARIMA(0,1,1)(0,0,1)[52]
##
## Coefficients:
## ma1 sma1
## -0.5880 0.0994
## s.e. 0.0526 0.0666
##
## sigma^2 = 0.004408: log likelihood = 370.42
## AIC=-734.84 AICc=-734.75 BIC=-723.87
checkresiduals(auto_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,0,1)[52]
## Q* = 45.524, df = 55, p-value = 0.8151
##
## Model df: 2. Total lags used: 57
auto_pred <- predict(auto_fit, n.ahead = 52*9)
ts.plot(data_ts , 2.718^auto_pred$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")
In summary, the ARIMA model appears to be more scientific than the
linear model as it tests for stationarity and separates out each
component to be dealt with individually. Choosing the components
p,d,q and P,D,Q can be determined from the ACF
and PACF plots, but ultimately a model needs to be built to determine if
the forecast is useful.
The linear model appears less thorough than the ARIMA model and does not split out the seasonal and moving average components. But the linear model can account for many factors such as the need to scale some data due to Covid. The ARIMA model requires the data to be pre-processed to account for this.
Both the ARIMA and linear model forecasts show the same seasonal trends but the ARIMA shows a steeper long term decline. This means that by 2030 the linear model is forecasting a 50% reduction in traffic from 2017 whilst the ARIMA is predicting a 61% reduction. Visually, the shape of the ARIMA model appears to better reflect the seasonal trend in the data and the line of best fit appears to better reflect the previous data
notes: project line of best fit backwards put on true data without covid being accounted for
try for other data sets
| Year | Linear Model | Linear % Change | Arima | Arima % Change |
|---|---|---|---|---|
| 2017 | 11,000,000 | 0% | 11,000,000 | 0% |
| 2018 | 10,000,000 | -5.5% | 10,000,000 | -5.5% |
| 2019 | 9,900,000 | -9.1% | 9,900,000 | -9.1% |
| 2020 | 9,900,000 | -9.3% | 9,900,000 | -9.3% |
| 2021 | 9,200,000 | -16% | 9,200,000 | -16% |
| 2022 | 8,600,000 | -21% | 8,400,000 | -23% |
| 2023 | 8,300,000 | -23% | 7,600,000 | -30% |
| 2024 | 7,900,000 | -27% | 7,000,000 | -35% |
| 2025 | 7,500,000 | -31% | 6,400,000 | -41% |
| 2026 | 7,100,000 | -35% | 5,900,000 | -46% |
| 2027 | 6,700,000 | -39% | 5,400,000 | -50% |
| 2028 | 6,200,000 | -43% | 5,000,000 | -54% |
| 2029 | 5,800,000 | -46% | 4,600,000 | -58% |
| 2030 | 5,400,000 | -50% | 4,200,000 | -61% |
– talk about the effect changing to 101 has with athe ACF plot and the forecast – plot the graph nicely in ggplot – add a block that’s auto-arima – add a summary
## log_data previous_y diff new_diff new_y fitted missing
## 1 16.24110 NA NA NA NA 16.24110 NA
## 2 16.35388 16.24110 0.1127783983 0.9496783 17.19078 16.35388 0.8368999
## 3 16.30815 16.35388 -0.0457251135 1.0204025 17.37428 16.30815 1.0661277
## 4 16.30760 16.30815 -0.0005541839 1.0002473 17.30840 16.30760 1.0008015
## 5 16.26138 16.30760 -0.0462178745 1.0206224 17.32822 16.26138 1.0668403
## 6 16.24287 16.26138 -0.0185142698 1.0082611 17.26964 16.24287 1.0267753
## x fitted residuals diff
## 1 16.24110 16.24110 0 0
## 2 16.35388 16.35388 0 0
## 3 16.30815 16.30815 0 0
## 4 16.30760 16.30760 0 0
## 5 16.26138 16.26138 0 0
## 6 16.24287 16.24287 0 0
\[ y_t = \delta + 0.0633 y_{t-1} \notag\\ \epsilon + -0.5094 y_{t-1} \]
An example of a stationary series is the change in Google stock price from one day to the next. The graph appears visually stationary about the mean of zero, but the Augmented Dickey-Fuller Test can be used to statistically test for stationarity.
The ADF test has a null hypothesis that there is a time
dependent relationship and an alternative hypothesis that there is not,
i.e the series is stationary. For this data the p-value is 0.01 which is
below the 0.05 (5%) value showing that the result is statistically valid
and the null hypotesis should be rejected and the alternative accepted.
The data is stationary.